successful attack
- North America > United States (1.00)
- Africa > South Africa (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Workflow (0.67)
- Research Report > New Finding (0.67)
- Media (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors
Dehdashtian, Sepehr, Morshed, Mashrur M., Seidman, Jacob H., Bharaj, Gaurav, Boddeti, Vishnu Naresh
Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Michigan (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
- Information Technology > Security & Privacy (0.95)
- Transportation > Air (0.82)
- North America > United States (1.00)
- Africa > South Africa (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Workflow (0.67)
- Research Report > New Finding (0.67)
- Media (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (9 more...)
Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
Wicaksono, Ilham, Wu, Zekun, Patel, Rahul, King, Theo, Koshiyama, Adriano, Treleaven, Philip
As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.
Supplementary Materials of Random Noise Defense against Query-Based Black-Box Attacks Zeyu Qin 1 Y anbo Fan
In this supplementary document, we provide additional materials to supplement our main submission. In Section A, we talk about the societal impacts of our work In Section B, we provide detailed experimental settings as well as further evaluation results on CIFAR-10 and ImageNet. In Section D, we give the proofs w.r .t . In Section E, we give the proofs w.r .t . The proofs of Theorem 3 are given in Section F. In Section C, we provide the analysis and evaluation of decision-based attacks.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology (0.47)
- Transportation > Air (0.42)
Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning
Graph Neural Networks (GNNs) have emerged as a powerful, data-driven approach for Quantum Error Correction (QEC) decoding, capable of learning complex noise characteristics directly from syndrome data. However, the robustness of these decoders against subtle, adversarial perturbations remains a critical open question. This work introduces a novel framework to systematically probe the vulnerabilities of a GNN decoder using a reinforcement learning (RL) agent. The RL agent is trained as an adversary with the goal of finding minimal syndrome modifications that cause the decoder to misclassify. We apply this framework to a Graph Attention Network (GAT) decoder trained on experimental surface code data from Google Quantum AI. Our results show that the RL agent can successfully identify specific, critical vulnerabilities, achieving a high attack success rate with a minimal number of bit flips. Furthermore, we demonstrate that the decoder's robustness can be significantly enhanced through adversarial training, where the model is retrained on the adversarial examples generated by the RL agent. This iterative process of automated vulnerability discovery and targeted retraining presents a promising methodology for developing more reliable and robust neural network decoders for fault-tolerant quantum computing.
LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge
Abdelnabi, Sahar, Fay, Aideen, Salem, Ahmed, Zverev, Egor, Liao, Kai-Chieh, Liu, Chi-Huang, Kuo, Chun-Chih, Weigend, Jannis, Manlangit, Danyael, Apostolov, Alex, Umair, Haris, Donato, João, Kawakita, Masayuki, Mahboob, Athar, Bach, Tran Huu, Chiang, Tsun-Han, Cho, Myeongjin, Choi, Hajin, Kim, Byeonghyeon, Lee, Hyeonjin, Pannell, Benjamin, McCauley, Conor, Russinovich, Mark, Paverd, Andrew, Cherubin, Giovanni
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
Bypassing Safety Guardrails in LLMs Using Humor
In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template -- it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness -- excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New York (0.04)
- (2 more...)
Attacking Vision-Language Computer Agents via Pop-ups
Zhang, Yanzhe, Yu, Tao, Yang, Diyi
Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)